This document explains the methodology of the correlations analysis for San Jose social distancing compliance and summarizes key results. It uses data on social distancing through 5/10/2020.
library(tidyverse)
library(plotly)
library(sf)
library(mapview)
library(tigris)
library(censusapi)
library(leaflet)
library(lehdr)
library(usmap)
options(
tigris_class = "sf",
tigris_use_cache = TRUE
)
Block groups with higher percentages of households earning over $125,000 annually tend to have greater increases in staying at home following the shelter-in-place order.
Block groups with higher percentages of individuals with a degree at the Associate’s level or higher also tend to have greater increases in staying at home; however, some of this effect may be related to the correlation between education and income.
Block groups with higher percentages of individuals that are Hispanic/Latino tend to have smaller increases in staying home; however, the effect of Hispanic/Latino percentage disappears when considered in combination with other demographic variables.
Of the demographic variables that we considered, the combination of income, education, Asian population, child population, and young adult population of a census block group provides the greatest predictive ability for the change in staying at home in that block group. Higher block group averages of income and education, as well as larger Asian and child populations, are associated with larger increases in staying at home; in contrast, a larger young adult population is associated with smaller increases in staying at home.
The data used for social distancing compliance comes from Safegraph’s social distancing dataset. This dataset tracks individuals’ cell phone GPS location. We have access to the aggregate behavior of movement that is at the census block group scale. In this analysis, we use the “completely at home” field, which Safegraph defines as a count of devices that did not leave their usual nighttime location (see documentation at https://docs.safegraph.com/docs/social-distancing-metrics). For each census block group in San Jose, we calculate the average percent of devices completely at home on weekdays since the start of the Bay Area shelter-in-place order (3/16/2020), as well as the percent of devices completely at home on weekdays during the months of January and February 2020, prior to the shelter-in-place order and widespread COVID-19 concerns. From these results, we obtain the percent of devices leaving home during these time periods.
In our analysis, we quantify the correlations between percentage of devices leaving their respective homes before and after the shelter in-place-order was instated and census block group averages of various demographic variables, including income, age, language ability, race, ethnicity, education level, vehicle ownership, occupants per room in a household, sex of workers, and high speed internet access. We choose to analyze these demographic variables based on our literature review of existing research on correlations with social distancing and our hypotheses of potentially relevant factors in San Jose and the Bay Area. Information on these demographic variables at the census block group level is from the American Community Survey 2014-2018 5 year summary data. We also quantify the correlations between these demographic variables and the change in percent of devices staying completely at home after the shelter-in-place order relative to before the order. This latter metric would suggest the ability of a community to alter their behavior to comply with the shelter-in-place order.
Income is a strong predictor of percentage of devices leaving home during the shelter-in-place order period. We consider different income thresholds, and conclude that the percentage of households in a block group earning over $125,000 annually is the best predictor. In the graph below, we plot the percentage of devices leaving home with the percentage of households making over $125,000, with each block group represented as a point on the graph. The best fit linear trendline is shown in orange. The slider on the bottom switches the data between percentage of devices leaving home before the shelter-in-place order to after the shelter-in-place order.
# load data
sj_dem_distancing_pre_post <- readRDS("/Users/simonespeizer/Documents/2020 Spring Quarter/CEE 218Z/covid19/Simone_Speizer/sj_socialdistancing_demdata_prepostdifs_manyvars.rds")
# combine the data so that plots can be animated with trendlines
# get the before shelter in place data
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home pre shelter`, blockgroup)
sj_dem_distancing_pre_shelter[is.na(sj_dem_distancing_pre_shelter)] <- 0
# relabel column
colnames(sj_dem_distancing_pre_shelter)[1] <- "% leaving home"
# add back demographic variables
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>% left_join(sj_dem_distancing_pre_post)
# get trendlines
sj_dem_distancing_pre_shelter <- sj_dem_distancing_pre_shelter %>%
mutate(
income_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% over 125,000`)),
hispanic_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`% hispanic/latino`)),
educ_trendline = fitted(lm((sj_dem_distancing_pre_shelter)$`% leaving home` ~ (sj_dem_distancing_pre_shelter)$`percent associates or higher`))) %>%
cbind(`Before or After Shelter-in-Place` = "Before shelter-in-place")
# repeat for post shelter in place
sj_dem_distancing_post_shelter <- sj_dem_distancing_pre_post %>% dplyr::select(`% not completely at home`, blockgroup)
sj_dem_distancing_post_shelter[is.na(sj_dem_distancing_post_shelter)] <- 0
# relabel column
colnames(sj_dem_distancing_post_shelter)[1] <- "% leaving home"
# add back demographic variables
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>% left_join(sj_dem_distancing_pre_post)
# get trendlines
sj_dem_distancing_post_shelter <- sj_dem_distancing_post_shelter %>%
mutate(
income_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% over 125,000`)),
hispanic_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`% hispanic/latino`)),
educ_trendline = fitted(lm((sj_dem_distancing_post_shelter)$`% leaving home` ~ (sj_dem_distancing_post_shelter)$`percent associates or higher`))) %>%
cbind(`Before or After Shelter-in-Place` = "After shelter-in-place")
# combine them
sj_dem_distancing_pre_post_separate <- rbind(sj_dem_distancing_pre_shelter, sj_dem_distancing_post_shelter)
# convert the before/after column to factor so it shows up correctly on the plots
sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place` <- factor(sj_dem_distancing_pre_post_separate$`Before or After Shelter-in-Place`, levels = c("Before shelter-in-place", "After shelter-in-place"))
fig_income <-
plot_ly (sj_dem_distancing_pre_post_separate) %>%
add_trace(
x = ~`% over 125,000`,
y = ~`% leaving home`,
frame = ~`Before or After Shelter-in-Place`,
type = 'scatter',
mode = 'markers',
showlegend = F
) %>%
add_trace(
x = ~`% over 125,000`,
y = ~income_trendline,
type = 'scatter',
mode = 'lines',
line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
frame = ~`Before or After Shelter-in-Place`,
showlegend = F
) %>%
animation_button(visible = F) %>%
animation_slider(
pad = list(t =75),
currentvalue = list(visible=F)
) %>%
layout(xaxis = list(title = 'Percentage of households making over $125,000'), yaxis = list(title = 'Percentage of devices leaving home'), margin = list(l = 75,r = 75))
fig_income
From this figure, we see that during the shelter-in-place order period a higher percentage of households making over $125,000 in a block group correlates with fewer devices leaving the home in that block group. This trend is the opposite of that observed prior to the shelter-in-place order, suggesting that block groups with a greater percentage of households of higher income were more able to adjust their behavior to comport with the shelter-in-place order.
To better assess this relative change in behavior, we fit a linear model to the change in devices staying completely at home after the shelter-in-place order (relative to before the order) and the percentage of households earning more than $125,000. The results of that model, including the coefficient on income (the slope of the linear fit) and the R-squared value, are shown below.
Coefficient:
income_125_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000`, sj_dem_distancing_pre_post)
print(summary.lm(income_125_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3871 0.75357 17.76 1.840e-56
## `% over 125,000` 0.3068 0.01665 18.43 9.009e-60
R-squared:
print(summary.lm(income_125_model_dif)$r.squared, digits = 4)
## [1] 0.3746
From the coefficient value, we see that as the percentage of households with incomes over $125,000 in a block group increases by 1%, the difference between the percentage of devices staying completely at home after the shelter-in-place order and the percentage completely at home before the order increases by about 0.31%. The R-squared value assesses the degree to which this model accurately predicts the variation in change in devices staying completely at home observed in the data; the result of 0.37 indicates that the linear fit with income predicts about 37% of the observed variation. The low p value indicates that these results are statistically significant. This is a relatively strong prediction, even without examining the effect of other demographic variables.
We next consider correlations between the racial and ethnic makeup of a block group and its social distancing compliance. We focus our analysis on Hispanic/Latino population, Asian population, and White population, as these are the main races/ethnicities in San Jose. Note that these categories are defined in accordance with the US Census descriptions and as such the race variables refer to an individual’s origins.
We find that percentage of residents of a block group that are White does not correlate with change in leaving home behavior of that block group (R-squared 0.005). Higher percentage of residents that are Asian has a weak positive correlation with fewer devices leaving home following the shelter-in-place order (coefficient 0.14, R-squared 0.1). In contrast, higher percentage of residents that are Hispanic/Latino in a block group correlates with more devices leaving home following the shelter-in-place order, as shown in the graph below.
fig_hisp <-
plot_ly (sj_dem_distancing_pre_post_separate) %>%
add_trace(
x = ~`% hispanic/latino`,
y = ~`% leaving home`,
frame = ~`Before or After Shelter-in-Place`,
type = 'scatter',
mode = 'markers',
showlegend = F
) %>%
add_trace(
x = ~`% hispanic/latino`,
y = ~hispanic_trendline,
type = 'scatter',
mode = 'lines',
line = list(size = 5, color = 'rgba(255, 165, 0, 1)'),
frame = ~`Before or After Shelter-in-Place`,
showlegend = F
) %>%
animation_button(visible = F) %>%
animation_slider(
pad = list(t =75),
currentvalue = list(visible=F)
) %>%
layout(xaxis = list(title = 'Percent of residents that are Hispanic or Latino', autorange = "reversed"), yaxis = list(title = 'Percent of devices leaving home'), margin = list(l = 75,r = 75))
fig_hisp
The results of the linear model fitting the change in percentage of devices staying completely at home and the percentage of residents that are Hispanic/Latino are shown below.
Coefficient:
hispanic_model_dif <- lm(`% increase in staying completely home` ~ `% hispanic/latino`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.2654 0.5866 56.71 7.914e-236
## `% hispanic/latino` -0.2278 0.0150 -15.18 6.276e-44
R-squared:
print(summary.lm(hispanic_model_dif)$r.squared, digits = 4)
## [1] 0.289
As the percentage of individuals that are Hispanic/Latino in a block group increases by 1%, the change in percentage of devices staying completely at home decreases by about 0.23%, meaning that a higher percentage of Hispanic/Latino residents is associated with a smaller increase in staying completely at home. This linear model with the Hispanic/Latino population of a block group predicts about 29% of the observed variation in the change in devices staying completely at home, comparable with the linear fit for education that was previously shown. The results are again statistically significant.
However, we hypothesize that the correlation observed here might be related to underlying correlations between Hispanic/Latino population and other demographic variables. To test this, we perform a multiple regression analysis with Hispanic/Latino population, income, education, and Asian population, yielding the following results.
Coefficients:
hispanic_inc_educ_asian_model_dif <- lm(`% increase in staying completely home` ~ `% hispanic/latino` + `% over 125,000` + `percent associates or higher` + `% Asian`, sj_dem_distancing_pre_post)
print(summary.lm(hispanic_inc_educ_asian_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.83791 2.33245 4.647 4.205e-06
## `% hispanic/latino` -0.01484 0.02609 -0.569 5.696e-01
## `% over 125,000` 0.21030 0.02256 9.320 2.600e-19
## `percent associates or higher` 0.09562 0.03045 3.141 1.775e-03
## `% Asian` 0.07632 0.01580 4.831 1.752e-06
R-squared:
print(summary.lm(hispanic_inc_educ_asian_model_dif)$r.squared, digits = 4)
## [1] 0.439
When accounting for education, income, and the Asian population of a block group, the percentage of residents that are Hispanic/Latino loses its predictive ability–its p value is no longer statistically significant–though all three of the other variables are significant. Income appears to be the key variable, followed by education and Asian population. This combined model predicts 44% of the variation in the change in percentage of devices staying completely at home.
As we have shown, income, education level, and Asian population together are very strong predictors of the change in percentage of devices staying completely at home in a block group. These three variables, when combined with the child and young adult population of a block group, yield a model with the greatest predictive ability for the change in percentage of devices staying completely at home. Though the two age variables are not strong predictors on their own, when included in a multivariable model they do provide additional predictive power beyond that of a model with only income, education level, and Asian population.
Note that this best-predicting model was determined by comparing multiple regression analyses of different combinations of the census variables we include in our analysis, and offers the highest predictive ability that we find using these demographic variables; however, there could be other variables that we do not consider here that may add further explanatory power.
The parameters of the best-predicting model from our analysis are presented below.
Coefficients:
inc_educ_asian_child_yad_model_dif <- lm(`% increase in staying completely home` ~ `% over 125,000` + `percent associates or higher` + `% Asian` + `percent less than 18` + `percent 20-29`, sj_dem_distancing_pre_post)
print(summary.lm(inc_educ_asian_child_yad_model_dif)$coefficients, digits = 4, signif.stars=TRUE)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.85199 1.91126 3.585 3.663e-04
## `% over 125,000` 0.16596 0.02248 7.382 5.652e-13
## `percent associates or higher` 0.13763 0.02204 6.246 8.312e-10
## `% Asian` 0.08802 0.01385 6.357 4.243e-10
## `percent less than 18` 0.21799 0.04854 4.491 8.617e-06
## `percent 20-29` -0.13899 0.04179 -3.326 9.376e-04
R-squared:
print(summary.lm(inc_educ_asian_child_yad_model_dif)$r.squared, digits = 4)
## [1] 0.4861
All five of these variables are significant in this model. Higher income and educational attainment in a block group, higher percentage of residents that are Asian, and higher percentage of residents that are children are all associated with larger increases in percent of devices staying completely at home; higher percentage of residents that are ages 20-29 is associated with a smaller increase in percentage of devices staying completely at home following shelter-in-place. These five variables together predict about 48% of the variation in the change in percentage of devices staying completely at home.
Summarizing, income predicted about 37% of the change in percent of devices staying completely at home, and adding in education raised this prediction to 41%. Including percentage of residents that are Asian boosted the predictive power to 44%, and adding percentage of residents that are younger than 18 and between the ages of 20-29 raised it to 48%.
We briefly summarize the results for other demographic variables that do have some correlation with changes in staying at home, but that are not found to be important in the highest-predicting multiple regression model. These variables include high speed internet access, occupants per room in a household, English language ability, and Spanish language ability.
The analysis on internet access, specifically the percentage of households that have access to high speed internet, was inspired by the paper “Social Distancing, Internet Access and Inequality” by Chiou and Tucker (https://www.nber.org/papers/w26982) that found that the combination of high speed internet access and high income was the key driver of ability to stay at home. We do indeed find a correlation between increase in percentage of devices staying completely at home and percentage of households with broadband such as cable, fiber optic or DSL (coefficient 0.36, p value < 2e-16, R-squared 0.22). However, high speed internet access does not provide any additional information to a model that already incorporates income; including high speed internet access in a regression with income raises the R-squared value by less than 0.01 relative to the R-squared of 0.37 for the model with income alone, and does not provide additional useful information in the multivariable regression model.
Similarly, though the percentage of households that have 1 or fewer occupants per room also correlates with change in devices staying at home (coefficient 0.37, p value < 2e-16, R-squared 0.18), this metric also fails to provide significant additional predictive power over income (R-squared 0.38 for income and occupants per room combined).
Percentage of residents speaking English well provided some, but less, predictive power on its own (coefficient 0.36, p value < 2e-16, R-squared 0.12), but again was not significant in multiple regression analyses that incorporated other demographic variables.
Percentage of residents speaking Spanish did offer some predictive power (coefficient of -0.24, p value < 2e-16, R-squared 0.25) but is a very similar metric to Hispanic/Latino population, and was similarly insignificant when combined with education and income.
Demographic variables we consider that lack strong correlations with changes in staying at home include percentage of residents ages 65 and older (R-squared 0.04), percentage of households with a vehicle available (R-squared 0.08), percentage of workers that are male (R-squared 0.0002), and, as mentioned previously, percentage of residents that are White (R-squared 0.005).
All of the analyses considered here are correlatory, and thus should not be taken to indicate any kind of causal relationship. Our results on shelter-in-place behavior and Hispanic/Latino population exemplify how apparent correlations may disappear when the effects of other variables are taken into account. Thus, though our results suggest that block groups with more residents of higher incomes and higher education levels tend to be those with greater increases in staying at home, they do not explain why this relationship exists nor rule out the possibility of other underlying factors not considered in this analysis driving these trends. For example, block groups with more residents of higher income and education level may also be those with more residents who can work from home, while those with more residents of lower income and education level may be those with more residents who are essential workers. Further analysis is needed to better characterize the relationships that our results identify between demographic variables and social distancing compliance.
Our full analyses can be viewed here https://stanfordfuturebay.github.io/simone_sd_correlations_analysis_sj_01.html.
Contact Simone Speizer (sspeizer@stanford.edu) with questions.